Goto

Collaborating Authors

 figure 3



LimitstoDepth-EfficienciesofSelf-Attention

Neural Information Processing Systems

Self-attention architectures, which are rapidly pushing the frontier innatural language processing, demonstrate asurprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) isjust as useful as increasing the number of self-attention layers (network depth).




6cfe0e6127fa25df2a0ef2ae1067d915-Paper.pdf

Neural Information Processing Systems

However,maximum-marginclassifiers areinherently robusttoperturbations ofdata at prediction time, and this implication is at odds with concrete evidence that neural networks, in practice, are brittle toadversarial examples [71]and distribution shifts [52,58,44,65]. Hence, the linear setting, while convenient to analyze, is insufficient to capture the non-robustness of neural networkstrainedonrealdatasets.Goingbeyondthelinearsetting,severalworks[ 1,49,74]arguethat neuralnetworksgeneralize wellbecause standard training procedures haveabiastowardslearning





1 Appendix 1 Bayes-by-backprop The Bayesian posterior neural network distribution P (w |D) is approximated

Neural Information Processing Systems

In Algorithm 1 we give the full clustering algorithm used for each of the T fixing iterations. In Figure 1 we show how the layers' In Figure 2 we show the impact of increasing the regularisation strength.